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Summary 

Resampling-based expression pathway analysis techniques have been shown to preserve type I error rates, 
in contrast to simple gene-list approaches that implicitly assume the independence of genes in ranked lists. 
However, resampling is intensive in computation time and memory requirements. We describe accurate 
analytic approximations to permutations of score statistics, including novel approaches for Pearson's cor- 
relation, and summed score statistics, that have good performance for even relatively small sample sizes. 
Our approach preserves the essence of permutation pathway analysis, but with greatly reduced computa- 
tion. Extensions for inclusion of covariates and censored data are described, and we test the performance 
of our procedures using simulations based on real datasets. These approaches have been implemented in 
the new R package safeExpress. 

Keywords: Gene sets; Multiple hypothesis testing; Permutation approximation. 

1. Introduction 

A basic approach to gene expression analysis involves the detection of genes significantly associated with 
clinical outcome, treatment, or experimental design variables. This kind of analysis focuses on individual 
genes. However, researchers are often interested in the association of the clinical/treatment variable with 
sets of genes of related biological function, either to increase power or to provide a more parsimonious, set- 
based view of the results. We use the term pathway to refer generically to a gene set under study, whether 
or not the set is a true metabolic or signaling pathway. Numerous methods for pathway analysis have been 
proposed (Dinu and others, 2009), and can be divided into approaches that implicitly assume uncorre- 
lated expression data vs. those that acknowledge correlation (Barry and others, 2008). As described in 
Gatti and others (2010), approaches that acknowledge correlation via permutation have vastly superior 
type I error control properties compared with methods that assume no correlation, and include Gene Set 
Enrichment Analysis (GSEA) (Subramanian, 2005) and Significance Analysis of Function and Expression 
(SAFE) (Barry and others, 2005). The globaltest of Goeman and others (2004) offers a non-resampling 
approach, in which the correlation structures are parametrically modeled. 
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The SAFE procedure computes local statistics, which measure the association between individual genes 
and the clinical/treatment variable, and global statistics, which are aggregations of local statistics for an 
entire pathway. Local statistics can be directional, i.e. reflecting positive or negative association with the 
clinical/treatment variable, or non-directional, in which only the magnitude of association is of interest. 
Global statistics can be self-contained, meaning that local statistics are aggregated within the pathway only, 
or competitive, comparing the genes in the pathway to those in the complement (following the terminology 
of Goeman and Buhlmann, 2007). A wide variety of procedures, including GSEA, are encompassed in the 
framework, simply by varying the statistics (Barry and others, 2008). In this paper, we focus on score- 
based local statistics and global statistics based on sums and differences. No further familiarity with the 
literature on pathway analysis is assumed. 

Gene set analysis using permutation of samples dates to Virtaneva and others (2001), in which the 
clinical/treatment variable is permuted relative to the expression data. The appeal of permutation is that it 
conditions on gene expression without requiring an explicit understanding of the underlying structure. As 
described in Gatti and others (2010), proper handling of the correlation of gene expression levels among 
genes in a set is essential for error control. Once proper per-pathway ^-values are obtained, standard tech- 
niques for controlling the false discovery rate or family-wise error rate across multiple pathways may 
be used. 

A downside to permutation is that it is computationally intensive, keeping in mind that all genes must be 
examined for each permutation (for competitive global statistics). When testing numerous gene sets (which 
may reach several thousand), it may be difficult to achieve multiple test-corrected significance, unless the 
number of permutations is very large. Using standard desktop computing, performing 1000 permutations 
of the entire dataset is typical (Knijnenburg and others (2009)). Effective false-positive control across large 
numbers of pathways may not be possible, as the standard empirical />-values are constrained to a minimum 
1 /(number of permutations). 

As an alternative viewpoint, we note that correlation among local test statistics is induced by the 
correlation of expression levels among genes (Barry and others, 2008). It is thus worth exploring whether 
the permutation null distributions of global statistics can be parametrically approximated, using empirical 
estimates of correlation structure. We emphasize that actual permutation, although unbiased, can be sub- 
ject to considerable sampling variability unless the number of permutations is large. Here, we provide new 
analytic approximations, with comparisons with random permutation. For a proposed non-directional com- 
petitive global statistic, we are not aware that competing parametric approximations have been proposed. 



We suppose that each sample j — \,2, ...,n is associated with a clinical/treatment value xi, with 
the vector denoted by x, and can be discrete or continuous. We later describe approximate proce- 
dures to handle censored time-to-event data. We use g, 7 to denote the expression level for the ;'th gene 
(z = 1, 2, . . . , m) and jth sample. Let Y be the m x n normalized gene expression matrix, with y t j — 



(gij ~ gi.)/ y J2"=i (Sij ~ gi.) 2 / n > where gi. = J2j g'j/ n - This normalization produces Y?j=\ ViJ = 0 and 
yfj — 1- An important local statistic to represent the relationship between the z'th gene and the clini- 
cal/treatment variable is the score statistic, which for linear regression and a variety of generalized linear 
models can be expressed as (Schaid and Sommer, 1994) 



2. Methods 



2.1 Notation 





(2.1) 
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where s x is the sample standard deviation of the x values. Our formulation follows that of Lee and others 
(2011) (except for notational differences), who obtained results for score statistics that we use below, 
although in a different context. Score statistics have comparable asymptotic power properties as Wald 
and likelihood ratio statistics for local departures from the null, although in small-sample settings other 
statistics may have slight advantages (Harris and Peers, 1980). One source of confusion in assessing power 
arises from improper control of type I error rates, and we are motivated to consider exact testing (corre- 
sponding to exhaustive permutation) as a "gold standard". We note that, for each Si, the roles of y, vs. x 
are interchangeable, and the same score statistic applies to a number of models in which x is treated as the 
response variable and gene expression Y as the predictor. Examples include logistic or Poisson regression 
for binary or integer x. 



2.1.1 Survival analysis. For right-censored clinical data, we use the martingale residuals 

xj =Sj - A 0 (tj), 

(Therneau and Grambsch, 1 990), where <5 7 is the death (i.e. non-censored) indicator for the j th individual at 
the observed time tj and Ao is the estimated cumulative hazard. Martingale residuals have also been used in 
a version of globaltest (Goeman and others, 2005), and provide considerable computational convenience 
compared with proportional hazards regression. As x and Y are treated symmetrically in the score statistic, 
it does not matter that the censored data are more naturally considered a response value than a predictor. 



2.2 Self-contained testing 

Under the complete null hypothesis that no gene shows differential expression, it is reasonable to focus on 
each pathway while ignoring the remaining genes. We use {path} to denote a pathway with ff! path genes. 
We (Barry and others, 2008) have criticized self-contained testing for failing to account for gene effects 
not specific to a pathway. Nonetheless, self-contained testing may sometimes be reasonable. These include 
situations where (i) few genes are differentially expressed; (ii) no gene is significant when accounting for 
genome-wide multiple comparisons; or (hi) a candidate pathway is tested, where any evidence of associa- 
tion is of interest. 



2.2.1 The directional statistic U. A simple aggregation of directional local statistics 5, is U — 
Siepath which is sensitive to associations between expression and the clinical/treatment variable that 
are in the same direction. Nonetheless, testing for this statistic will generally be two-sided. Motivated by 
the central limit theorem, a normal approximation to U might seem appealing, as summation is performed 
both over n (within Si) and m pa th. However, the distribution of U is fundamentally limited by n, shown by 
rewriting 

^ " J \/epath j j 

for y'j = X);epafli y*j ( w i m me vector of these values denoted by y'). In this formulation, it is sim- 
ple to show (Appendix Al of supplementary material available at Biostatistics online) that under per- 
mutation U is one-to-one with the Pearson correlation ry between y' and permutations of x. This 
immediately suggests the density approximation for r v as Beta (I, i(n — 2)), the exact uncon- 
ditional null density if either x or y' is drawn from a normal density. Under permutation, the 
expectation of r v is the same as that predicted by the standard beta approximation. However, the 
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Fig. 1. Qqplots of p-values for rh, computed over exhaustive permutations on the salivary gland data, for the 46 
probesets in the KEGG:00271 pathway (hq = n\ = 9). 

permutation variance may differ from the standard approximation. Here, we improve the standard 
approximation by choosing beta parameters to match the first two exact permutation moments of r\j 
(Appendix Al of supplementary material available at Biostatistics online). 

2.2.2 An example. As a simple example where the standard beta approximation can fail, we consider a 
salivary gland expression dataset (ArrayExpress E-GEOD-745 1 , Affymetrix U 1 33 plus 2.0). Of 20 original 
arrays, two appeared to be accidental duplicates, and we restrict attention to the remaining 18, of which 
n i = 9 with Sojgren's syndrome were compared with no — 9 controls, and with the patient age reported. For 
the 46 probe sets (essentially genes) in pathway 0027 1 from the Kyoto Encyclopedia of Genes and Genomes 
(KEGG:00271), Methionine metabolism, the empirical r\j p-values were computed for the Q = 48 620 
unique permutations, and are guaranteed to be rank-uniform. Figure 1 shows a qqplot for the various p- 
values approaches, along with /> -values provided by a saddlepoint approximation for the dichotomous 
setting (Appendix A2 of supplementary material available at Biostatistics online). The corrected Beta has 
similar performance as the saddlepoint, but is much simpler to calculate, and its underlying approximation 
is easily incorporated into other global statistics described below. 

2.2.3 The non-directional statistic V. The squared score (local) statistics are non-directional, and a sim- 
ple non-directional global statistic to detect departures from the self-contained null is a summation of the 
squared score statistics in the pathway, 

V=Y,Sf. (2.2) 

iepath 
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This statistic is widely used in association testing of single nucleotide polymorphisms (SNPs) and genes 
(Liu, 2010). Moreover, for the expression context it is equivalent to the globaltest (Goeman and others, 
2004) when the expression data have been scaled for each gene (Pan, 2009). Goeman and others 
(2004) argued that V is optimal when aggregating small effect sizes. A standard approximation uses 
quadratic form results and moment-matching based on non-central (Liu and others, 2009), or scaled 
(Duchesne and Micheaux, 2010) x 2 - "Naive" moment estimates here are E n (V) ^ wj pa th (which would fol- 
low from a xl approximation for each Sf), and var n (F) « 2 trace(R T R), where R — Y path Y pat h is the cor- 
relation matrix for the genes in {path}. However, for small to moderate sample sizes, approaches described 
in Goeman and others (2004) can be used to correct the moments. We highlight an alternate formulation, 
described, for instance, in Lee and others (201 1): 



where Xj is the y'th eigenvalue of Y path Y pat h, and rj is the squared Pearson correlation between the y'th 
principal component of Y path and x (for j — 1, . . . , n — 1). In other words, r, = corr(p 7 , x), where p • is 
the yth column of P and P T is the right singular matrix in the singular value decomposition of Y pat h. 

Equation (2.3) shows how the rank of Y path affects V. If m path < n — 1, some of the Xj will be zero, and, 
even for large ffJ pat h, the number of contributing terms cannot exceed n — 1. Note that the orthogonality 
of the first n — 1 principal components (PCs) (for pathways with m pa th > n — 1) implies that ^ rj = 1, 
while, for constant V , (2.3) is the equation of an ellipse. An elementary comparison of these two expres- 
sions (a circle vs. an ellipse where each rj is a coordinate) implies that V cannot exceed nX\. This is 
among the reasons that / 2 -based approximations to V can fail in the extreme tail, even if the moments are 
specified correctly. 

We use (2.3) to motivate an alternative analytic approximation to the permutation distribution of V . 
The standard r 2 approximation implies that each rj ~ Beta ( \ , (n — 2) /2) , noting that any of the PCs can 
serve the role played by y' in the earlier subsection. Although the PCs are orthogonal, the {rj} are actually 
somewhat negatively correlated over permutation, as implied by results in Appendix B2 of supplementary 
material available at Biostatistics online. Unconditional multivariate normality assumptions for x and P 
imply that the {rj} follow a correlated joint beta density, related to the multiple correlation coefficient 
sampling distribution (Fisher, 1938). We modify this density for our permutation setting, where normality 
is not assured, by replacing the standard beta parameters with the exact moment-corrected versions, as 
derived earlier for r\j. Examples below show that the approach works well, even for dichotomous x and 
modest sample sizes. The approximating joint density is derived in Appendix B 1 of supplementary material 
available at Biostatistics online, but may be simply described as a recursion of successive rj. Recalling 
that rj — cor(p 7 -, y), we have the standard approximation r\ ~ Beta (j, (n — 2)/2), and show that, for 
any subset £2 C {1, — 1} that does not contain k, rj-/({l — Y^jen r j}) ~ Beta Q> ( w ~~ 1^1 ~~ 2)/2) . 
Here | £2 1 denotes the number of elements in Q,. Therefore, if r\ is generated, the remaining values can be 
drawn conditionally as r\ = B\ x (1 — r\), rf = B2 x (1 — r\ —r\), etc., where each Bj is an independent 
draw from Beta Q, (n — 1 — y)/2). Thus, V can be approximated as a sum of weighted correlated beta 
variates. The approximation reflects correlation structure that may be non-trivial for moderate sample size, 
with tails that are short enough to be realistic. The recursion applies to any ordering of the PCs, but ordering 
by eigenvalues is helpful for numeric approximation. 

We estimate />-values by numeric integration over a 2D grid of {r\, rf } for the initial term X\r\ + Xir\ 
(100 x 100, or 1000 x 1000 for more accuracy). Then a shifted gamma approximation is used for the 
remaining terms in (2.3) (Appendix B2 of supplementary material available at Biostatistics online). The 
shifted gamma distribution is an ordinary gamma with an additional location parameter, specified by 



H-l 




(2.3) 
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moment-matching from the eigenvalues. Our procedure enables testing of thousands of categories on a 
standard PC. 

Equation (2.3) is similar to the asymptotic representation of Goeman and van Houwelingen (201 1) for 
the globaltest statistic as a weighted sum of independent / 2 random variables, providing important large- 
sample motivation for our approximations below in the generalized linear model setting. In addition, the 
authors provide an alternate representation under the linear model for finite samples. We emphasize that 
our representation in (2.3) is exact over permutations, thus transparently highlighting the importance of 
the PCs. 

2.3 Competitive testing 

Competitive global test statistics contrast the local statistics within each pathway vs. those of the comple- 
mentary gene set (Goeman and Buhlmann, 2007). For datasets in which many genes are associated with 
x, this approach enables a focus on truly pathway-related findings, rather than non-specific results that 
apply to all genes. Note that permutation induces a special case of the null hypothesis in which all genes 
are null, but competitive tests remain interpretable, reflecting the correlation structure in each pathway vs. 
its complement. Barry and others (2008) discuss these issues in detail, showing that permutation may be 
slightly conservative when the competitive null holds, i.e. genes may exhibit differential expression, but 
the pathway is not enriched for such genes. 

For aggregations of directional local statistics, a straightforward competitive global statistic is 
^path/»Vth ~ £/comp/>K C omp, wnere {comp} designates the complementary set of genes not in {path}. 
However, for array studies we must consider the role of data normalization, and we note that C/ a u = 
C7path + £4omp- If the data have been normalized so that each array has the same mean expression (which is 
true for simple centering procedures or for quantile normalization), then f/ all will be constant, and C/ pat h and 
U comp will have correlation — 1 over permutations. Thus, the construction of a competitive global statistic 
would be redundant, and (7 pat h is essentially already a competitive statistic. Although this statement is not 
strictly true for other normalization procedures, it seems clear that normalization is likely to have a large 
impact on inference, and we argue that the original £/ pat h should not be further modified in an attempt to 
make it "competitive". 

2.3.1 The competitive statistic D. A natural competitive global statistic based on sums of squared score 
statistics is D — F pat h/ m p ^ — V comp /m comp . Asm comp is typically much larger than «? pat h, one might expect 
that V comp /m comp is nearly constant. In fact, for many datasets, the gene-gene correlation structures are suf- 
ficiently strong that variation in V CO m P /m comp is non-negligible. Moreover, F pat h and K comp are correlated. 
The primary barrier to approximating the distribution of D is that the form of (2.2) cannot be negative, 
involving sums of squared terms, while its equivalent (2.3) is critical to the beta-mixture approach. As we 
show in detail in Appendix C of supplementary material available at Biostatistics online, this obstacle is 
neatly overcome by adding weights w, to the score statistics {Sf}, where the weights for i e {comp} are 
imaginary, resulting in negative squared terms. Formally, we create a new weighted matrix A = WY, where 
W is the diagonal matrix with terms {w,}. Here, a new equivalent relation holds, 

m n 

D=J2 w f s f= n J2yj c p (2 - 4 > 

where {y 7 } are the eigenvalues of A T A, and cj are the observed squared correlations between the corre- 
sponding eigenvectors and x. We see that A T A is a real matrix, and so no complex algebra is required for 
the decomposition. Moreover, computation is greatly simplified by computing Y T Y once, and computing 
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A T A = Y T Y — Yp ath Y path for each pathway. Equation (2.4) shows that D may be approximated by the beta 
mixture, but with weights y, that are both positive and negative (and in fact sum to zero). 

An alternative competitive statistic based on ratios, rather than differences, is also described in 
Section 3. 

2.4 Inclusion of covariates 

Suppose z\ , Z2, ■ ■ ■ , z p are a set of n-vector covariates, any of which may be correlated with Y or x, or 
perhaps both. In principle, score statistics in the presence of covariates involve straightforward maximiza- 
tion over a restricted null space, which could be applied for each gene. However, we still need to handle 
gene-gene correlation, for which permutation is attractive. The proper handling of covariates is a chal- 
lenge in the permutation setting (Buzkova and others, 201 1). We employ one of the approaches described 

in Kennedy and Cade (1996), computing new local statistics S ia — YTj=\ x j,zyij,zl \Jj2j(( x j,z — x z ) 2 /n), 
where Y,., 2 and x z have been adjusted by linear regression for the n x p covariate matrix Z. Thus, S^ z is 
proportional to the partial correlation between y, and x after adjusting for Z. Then we compute U z , V z , 
and D z using the methods described earlier. However, we do employ a correction for the effective loss of 
p degrees of freedom (Appendix D of supplementary material available at Biostatistics online). 

Collectively, we refer to all of the proposed methods above as the safeExpress procedure, with an accom- 
panying R package. 

3. Results 
3.1 Power 

A number of studies have compared the power of various local/global tests. The connection of V to 
the widely used globaltest suggests that it should have high power for a range of alternatives. More- 
over, U and D are quite simple in their justification and connection to V and to the underlying score 
statistics. Liu and others (2007) and Dinu and others (2007) investigated a number of approaches and 
found that the globaltest was highly competitive, and several authors have discussed weaknesses in the 
GSEA Kolmogorov-Smirnov statistic (Barry and others, 2008; Dinu and others, 2007). The study of 
Fridley and others (2010) examined 10 different self-contained global statistics for a continuous x. The 
authors found that the Fisher combined /7-value (FCP, for which permutation was necessary to evalu- 
ate significance) was the most powerful practical statistic, essentially equal in power to more complex 
statistics. 

For each statistic and scenario, Fridley and others (2010) performed 1000 simulations, and 1000 per- 
mutations for each to obtain empirical />-values. In addition to the large number of statistics considered, 
these authors' efforts are notable for the range of scenarios, varying the size of the pathway (10, 50, 100 
genes), sample size (n — 20, 50, 100, 500), and average gene-gene correlation (p = 0, 0. 1 , 0.3), with each 
test performed at intended a = 0.05. Effect sizes were determined by simulating the continuous pheno- 
type as xj = fay^ + e for a coefficient vector B consisting of zeros and positive values, and normally 
distributed e. 

In all, 108 null and 2160 alternative hypothesis scenarios were simulated. Following the authors' simu- 
lation protocol, we ran safeExpress for 1000 simulations of each scenario. Figure SI of supplementary 
material available at Biostatistics online shows the type I error rates for the null scenarios. All of our 
statistics and approximations control type I error at the expected rate. For the alternative scenarios, the 
results confirm those reported by Fridley and others (2010), and are highly supportive of our approaches 
(Figure S2 of supplementary material available at Biostatistics online). Averaging across scenarios for 
each sample size, the power of U was somewhat less (4% points) than that of the FCP (Figure S2(A) of 
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Table 1. Performance of the p-value approximations for KEGG.00940 (18 genes), salivary gland 

dataset (n— 18) 



Threshold 

a 






Proposed approximation 








Continuous x 






Dichotomous x 




U 


V 


D 


U 


V 


D 


KT 1 


1.00 


0.93 


0.96 


1.00 


0.92 


0.94 


1(T 2 


1.00 


0.92 


0.94 


1.01 


0.97 


0.88 


1(T 3 


1.01 


0.90 


0.93 


0.95 


0.69 


0.84 


1(T 4 


1.02 


0.89 


0.90 


1.23 


1.22 


1.22 


1(T 5 


1.04 


0.97 


1.00 








1(T 6 


1.11 


0.59 


0.71 









Entries in the table are ratio,, = (true type I error rate)/a. For dichotomous x (no = "l = 9), p-values are limited by the number of 
possible permutations. 



supplementary material available at Biostatistics online). However, for some of the scenarios, U was more 
powerful, primarily with small samples (« = 20), low correlation (p = 0), and large cumulative effect sizes 
Y^i Pi (Figure S2(B) of supplementary material available at Biostatistics online). The power of the FCP 
was nearly identical to that of V and D (Figures S2(C) and (D) of supplementary material available at 
Biostatistics online). For D, we simulated an additional 10 000 genes in the complementary gene set with 
the same common p specified for the gene set. The complete set of null and power results are presented 
as supplementary files. While these simulations are extensive, power analyses presented elsewhere have 
already consistently supported simple aggregations of linear statistics. The overall conclusion is clear. 
Our aggregations of score statistics have competitive power with the most powerful statistics proposed by 
others, but for which significance is assessed by permutation. 

Finally, as a possible alternative to D, we assessed the power of a competitive statistic D ntl0 — 
(fpath/'Wpath)/(F C omp/'Wcomp), i-e. using ratios instead of differences. Here the significance of A-atio was 
determined by direct permutation, as we are not aware of an effective approximation. Within each of FCP 
power values near 0.1, 0.2, etc., we randomly selected five scenarios. Figure S3 of supplementary material 
available at Biostatistics online shows the power of D vs. A-atio- The statistics were comparable in power, 
although A- a tio seemed modestly less powerful in the mid-range. 

3.2 Additional investigations of type I error rate control 

We assessed the performance of our analytic approximations using annotated pathways for two datasets: 
(i) the saliva data described earlier and (ii) a breast cancer dataset from Miller and others (2005), (GSE3493 
Affymetrix U133A, where n — 251, m — 22 215). Martingale residuals for disease-free survival were avail- 
able as a continuous x for n — 236, and mutational status of p53 (—/+), a well-known tumor marker, as 
a dichotomous x (no — 198, n \ — 56). We selected a KEGG pathway for the salivary dataset and gene 
ontology (GO) pathways for the breast cancer dataset. 

Tables 1 and 2 display the type 1 error rate results for U, V, and D for the illustrative pathways. For 
dichotomous x in the salivary gland dataset, the true thresholds for type 1 error rate control were established 
by exhaustive permutation (48 620), and by 1 0 8 random permutations for other table entries. We emphasize 
that this large number of permutations for all pathways is impractical for standard use, and is used here to 
establish the accuracy of our approach. To cover the range of useful pathway testing thresholds, we display 
ratioa =(true Type I error rate)/a, for a ranging from 10~' to 10 -6 . The proposed approximations are 
reasonably accurate. 



Empirical pathway analysis, without permutation 



581 



Table 2. Performance of the p-value approximations for GO:0000184 (44 genes), with continuous breast 
cancer disease-free survival (n — 236) and dichotomous p53 status («o = 198, n\ — 56) 



Threshold 
a 






Proposed approximation 








Continuous x 






Dichotomous x 




u 


V 


D 


U 


V 


D 


KT 1 


1.00 


1.01 


1.01 


1.00 


1.01 


1.01 


io- 2 


1.00 


0.99 


0.96 


1.00 


0.99 


0.96 


1(T 3 


1.01 


0.96 


0.92 


1.01 


0.98 


0.93 


10~ 4 


1.07 


0.97 


0.82 


1.07 


1.00 


0.92 


io- 5 


1.27 


0.85 


0.84 


1.24 


1.32 


0.78 


io- 6 


1.52 


0.45 


0.76 


1.50 


0.91 


0.71 



Entries in the table are ratio, 



As V can be expressed as a quadratic form, there are competing analytic approximations that have been 
considered, not necessarily in a genomics context. Two popular approaches are based on (i) a scaled central 
X 2 distribution (Duchesne and Micheaux, 2010), for which the first two moments are typically used for 
fitting, and (ii) a non-central / 2 distribution (Liu and others, 2009), which is used in the globaltest with 
finite sample corrections. The globaltest results are shown in Table S 1 of supplementary material available 
at Biostatistics online for the salivary gland dataset, for which values less than a — 10~ 5 begin to show 
inaccuracy. For the dataset under dichotomous x, the smallest globaltest p-values are less than a — 10~ 5 , 
even though there are only 24 3 10 unique permuted V values. Using approximate / 2 results that are based 
on the naive moments are extremely poor, with true false positive rates differing by 1-2 orders of magnitude 
from that intended, and are not shown. 

To demonstrate the effectiveness of the covariate adjustment, we used KEGG:00510 (92 genes) for 
the salivary data. We consider the scenario that provides challenges for proper covariate adjustment: (i) a 
small sample size; (ii) two covariates, one which is correlated with both Y pat h and x. Recall that, in earlier 
results, the data (both expression and clinical/treatment variable) could be considered fixed, and it was 
possible for each pathway to create a single permutation distribution to which the analytic approximations 
could be compared. If the type I error rates are always controlled, conditioned on the observed data, then 
it follows that the type I error rate is also controlled unconditionally. Here, however, covariates disrupt the 
interpretation of the permutation distribution, and to establish control of the type I error rate, we condition 
on Yp a th, but generate random realizations of x and Z. For this example, covariate z\ was generated as 
0.2 x y' of Yp at h, plus an N(0, 1) error term. A dichotomous x was generated using the logistic model with 
logit(y) = z\ , and rejection sampling to ensure that n 0 — 9, n ! = 9. Covariate z 2 was generated as N(0, 1). 
After applying the covariate residualization and proper effective sample size, qqplot results for 10 000 data 
simulations show good performance of the resulting p-va\ues using our approximations for U z and V z 
(Figure S4 of supplementary material available at Biostatistics online). Failing to adjust for covariates, or 
incorrectly failing to account for the reduced rank of the weighted beta approximation, both produce highly 
inaccurate ^-values (Figure S5 of supplementary material available at Biostatistics online). Similar results 
hold for continuous x, for which an example is shown in Figure S6 of supplementary material available at 
Biostatistics online. 

To further illustrate the performance and efficiency of our methods, we analyzed the breast cancer data 
with a total of K — 6701 GO pathways, using 10 8 random permutations for the p53 status phenotype, and 
also using safeExpress. safeExpress is several orders of magnitude faster than random permutation for indi- 
vidual pathways. Direct permutation required tens of thousands of total CPU hours on a computing cluster, 
while safeExpress took at most a few hours on a standard PC (depending on the grid density; see Table S 1 
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Local statistics (on ranked scale) 

Fig. 2. Safe-plot of ranks of local S statistics, using disease-free survival for the breast cancer data, GO: 0009896. The 
plot shows an empirical cumulative distribution function of the ranks of the local score statistics Si within the pathway, 
interspersed among the ranked local statistics of genes in the complement. Genes in the pathway tend to have extreme 
positive or negative score statistics, representing poor or good prognosis when expression is high, respectively. 



of supplementary material available at Biostatistics online). Of course, 10 permutations are unnecessary 
for many pathways, as many pathways can be declared non-significant after only a few permutations. We 
show in Appendix E of supplementary material available at Biostatistics online that an adaptive permuta- 
tion procedure, designed to avoid unnecessary permutations, still requires 40+ times as much computation 
as safeExpress in order to effectively control false positives. In addition, safeExpress is designed to take 
advantage of the R multicore package to gain further efficiency from parallel processing. 

Using safeExpress, we found several highly significant GO pathways for the various x vectors: (i) p53 
mutation status in all patients; (ii) p53 mutation status for estrogen-receptor positive (ER+) tumors only 
(n =213); (iii) p53 mutation status in all patients, with covariate adjustment for ER status; and (iv) the 
continuous martingale residuals for disease-free survival. The most significant pathways, with Benjamini- 
Hochberg q -values computed for each of U, V, and D are provided as supplementary files. Here, we 
highlight just a few results from (iii) to (iv), with potential supporting literature. Results for (iii) included 
GO:0000778, Condensed Nuclear Chromosome Kinetochore (p — 5.2 x 10 -15 for U) (Chi and others, 
2009), and GO:0045767, Regulation of Anti-apoptosis (p = 8.5 x 10~ 10 for D). For (iv) (disease-free 
survival), fewer pathways were significant after multiple test correction, but included GO:0051087, Chap- 
erone Binding (p — 3 x 10~ 6 for U) (Marx and others, 2007), and GO:0009896 Positive Regulation of 
Catabolic Process (p — 1 x 10~ 7 for D) (Wallace and others, 2000), a pathway that includes IGF-1, a 
marker widely studied for its role in breast and other cancers (Chong and others, 201 1). For the last find- 
ing, a "safe-plot" (Barry and others, 2005) is shown in Figure 2, showing the ranks of local statistics within 
the pathway. 

Figure 3 compares the — log 10 (/>)-values from the 10 8 random permutations vs. the safeExpress approx- 
imations for U, V, and D across all 6701 pathways for testing p53 mutation status x. The agreement is 
close, and variation for large — \og l0 (p) is largely due to sampling variation from the random permutations, 
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Fig. 3. For the breast cancer data with a p53 dichotomous outcome, comparison between safeExpress and 10 random 
permutations on the — log W(p) scale. The numbers of pathways exceeding the >'-axis limit are shown in gray. 



not from approximation bias. Note that, with 10 8 permutations, standard permutation-based ^-values will 
not be less than 10~ 8 , even when the data indicate that the exact /(-value is much smaller. In contrast, 
safeExpress can identify and rank pathways that are much more highly significant. 



4. Discussion 

We have demonstrated that an analytic approach can provide approximations to permutation p-vahies 
for gene pathway testing, of high enough accuracy to substitute for permutation. Although confined to 
statistics that involve linear operations, our results include several novel techniques that are likely useful 
in other contexts. The corrected beta r 2 approximation applies generally to simple regression problems, 
and the weighted beta approximation applies to sums of squared score statistics, which are widely used in 
ensemble hypothesis testing. We are not aware that accurate analytic approximations to mean differences 
of squared score sums (D, in our context) have been previously proposed. 

For our speed comparisons, the random permutation approach was applied simultaneously to the entire 
expression matrix, and then summarized within each category. This an efficient approach when computing 
results for numerous categories. For a single category for i~u, it is feasible to compute a large number of 
random, though not entirely independent, permutations by computing cross-products of matrices consist- 
ing of random permutations of x and y'. Permutations of V using matrices of permuted x are also possible, 
though slower and with considerable RAM requirements. Using this approach for D (which uses all genes) 
quickly becomes unwieldy. As the number of principal components never exceeds n — 1 , another possibil- 
ity is to permute the PCs. However, the simple nature of our analytic approximations, and the relative ease 
of embedding the approximations into a software implementation, make safeExpress especially attractive. 

Extensions to our work include potential applications in SNP-set testing, sequence-based association 
analysis, and other 'omics applications involving sets of correlated statistics. For some of those appli- 
cations, the accuracy of the approximations must be demonstrated to even more stringent thresholds to 
account for an even larger number of tests. 



Supplementary material 
Supplementary material is available at http://biostatistics.oxfordjournals.org. 
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